perm filename PROPOS[7,ALS]1 blob sn#032368 filedate 1973-04-04 generic text, type T, neo UTF8
00010								April 3 1973
00020	
00030	 A Proposal for Speech Understanding Research
00040	
00050	
00060		It is proposed that the work on speech recognition that is
00070	now under way in the A.I. project at Stanford University be continued
00080	and extended as a separate project with broadened aims in the field
00090	of speech understanding. This work gives considerable promise both of
00100	solving some of the immediate problems that beset speech
00110	understanding research and of providing a basis for future advances.
00120	
00130		It is further proposed that this work be more closely tied to
00140	the ARPA Speech Understanding Research effort than it has been in the
00150	past and that it have as its express aim the study and application to
00160	speech recognition of a machine learning process, that has proved
00170	highly successful in another application and that has already been
00180	tested out to a limited extent in speech recognition. The machine
00190	learning process offers both an automatic training scheme and the
00200	inherent ability of the system to adapt to various speakers and
00210	dialects. Speech recognition via machine learning represents a global
00220	approach to the speech recognition problem and can be incorporated
00230	into a wide class of limited vocabulary systems.
00240	
00250		Finally we would propose accepting responsibility for keeping
00260	other ARPA projects supplied with operating versions of the best
00270	current programs that we have developed. The availability of the high
00280	quality front end that the signature table approach provides would 
00290	enable designers of the various over-all systems
00300	to test the relative performance of the top-down portions of their
00310	systems without having to make allowances for the deficiencies
00320	of their currently available front ends. Indeed, if the signature table
00330	scheme can be made simple enough to compete on a time basis (and we
00340	believe that it can) then it may replace the other front end
00350	schemes that are currently in favor.
00360	
00370		Stanford University is well suited as the site for such work,
00380	having both the facilities for this work and a staff of people with
00390	experience and interest in machine learning, phonetic analysis, and
00400	digital signal processing.
00410	
00420		Ultimately we would
00430	like to have a system capable of understanding speech from an
00440	unlimited domain of discourse and with an unknown speaker. It seems not
00450	unreasonable to expect the system to deal with this situation very
00460	much as people do when they adapt their understanding processes to
00470	the speakers idiosyncrasies during the conversation. The signature table
00480	method gives promise of contributing toward the solution of this
00490	problem as well as being a
00500	possible answer to some of the more immediate problems.
00510	
00520		The initial thrust of the proposed work would be toward the
00530	development of adaptive learning techniques, using the signature
00540	table method and some more recent varients and extentions of this
00550	basic procedure. We have already demonstrated the usefulness of this
00560	method for the initial assignment of significant features to the
00570	acoustic signals. One of the next steps will be to extend the method
00580	to include acoustic-phonetic probabilities in the decision process.
00610	
00620		Still another aspect to be studied would be the amount of
00630	preprocessing that should be done and the desired balance between
00640	bottom-up and top-down approaches. It is fairly obvious that
00650	decisions of this sort should ideally be made dynamicallly depending
00660	upon the familiarity of the system with the domain of
00670	discourse and with the characteristics of the speaker.
00680	Compromises will undoubtedly have to be made in any immediately
00690	realizable system but we should understand better than we now do the
00700	limitations on the system that such compromises impose.
00710	
00720		It may be well at this point to discribe the general
00730	philosophy that has been followed in the work that is currently under
00740	way and the results that have been achieved to date. We have been
00750	studying elements of a speech recognition system that is not
00760	dependent upon the use of a limited vocabulary and that can recognize
00770	continuous speech by a number of different speakers.
00780	
00790		Such a system should be able to function successfully either
00800	without any previous training for the specific speaker in question or
00810	after a short training session in which the speaker would be asked to
00820	repeat certain phrases designed to train the system on those phonetic
00830	utterances that seemed to depart from the previously learned norm. In
00840	either case it is believed that some automatic or semi-automatic
00850	training system should be employed to acquire the data that is used
00860	for the identification of the phonetic information in the speech. We
00870	believe that this can best be done by employing a modification of the
00880	signature table scheme previously discribed. A brief review of this
00890	earlier form of signature table is given in Appendix 1.
00900	
00910		The over-all system is envisioned as one in which the more or
00920	less conventional method is used of separating the input speech into
00930	short time slices for which some sort of frequency analysis,
00940	homomorphic, LPC, or the like, is done. We then interpret this
00950	information in terms of significant features by means of a set of
00960	signature tables. At this point we define longer sections of the
00970	speech called EVENTS which are obtained by grouping togather varying
00980	numbers of the original slices on the basis of their similarity. This
00990	then takes the place of other forms of initial segmentation. Having
01000	identified a series of EVENTS in this way we next use another set of
01010	signature tables to extract information from the sequence of events
01020	and combine it with a limited amount of syntactic and semantic
01030	information to define a sequence of phonemes.
01040	
01050		While it would be possible to extend this bottom up approach
01060	still further, it seems reasonable to break off at this point and
01070	revert to a top down approach from here on. The real difference in
01080	the overall system would then be that the top down analysis would
01090	deal with the outputs from the signature table section as its
01100	primatives rather than with the outputs from the initial measurements
01110	either in the time domain or in the frequency domain. In the case of
01120	inconsistancies the system could either refer to the second choices
01130	retained within the signature tables or if need be could always go
01140	clear back to the input parameters. The decision as to how far to
01150	carry the initial bottom up analysis must depend upon the relative
01160	cost of this analysis both in complexity and processing time and the
01170	certainty with which it can be performed as compaired with the costs
01180	associated with the rest of the analysis and the certainty with which
01190	it can be performad, taking due notice of the costs in time of
01200	recovering from false starts.
01210	
01220		Signature tables can be used to perform four essential
01230	functions that are required in the automatic recognition of speech.
01240	These functions are: (1) the elimination of superfluous and
01250	redundant information from the acoustic input stream, (2) the
01260	transformation of the remaining information from one coordinate
01270	system to a more phonetically meaningful coordinate system, (3) the
01280	mixing of acoustically derived data with syntactic, semantic and
01290	linguistic information to obtain the desired recognition, and (4) the
01300	introduction of a learning mechanism.
01310	
01320		The following three advantages emerge from this method of
01330	training and evaluation.
01340		1) Essentially arbitrary inter-relationships between the
01350	input terms are taken in account by any one table. The only loss of
01360	accuracy is in the quantization.
01370		2) The training is a very simple process of accumulating
01380	counts. The training samples are introduced sequentially, and hence
01390	simultaneous storage of all the samples is not required.
01400		3) The process linearizes the storage requirements in the
01410	parameter space.
01420	
01430		The signature tables, as used in speech recognition, must be
01440	particularized to allow for the multi-catagory nature of the output.
01450	Several forms of tables have been investigated. Details of the current
01460	system are given in Appendix 2. Some results are summarized in an
01470	attached report.
01480	
01490		Work is currently under way on a major refinement of the
01500	signature table approach which adopts a somewhat more rigorous
01510	procedure. Preliminary results with this scheme indicate that a
01520	substantial improvement has been achieved.
     

01415	
01420			Appendix 1
01430	
01440		The early form of a signature table
01450	
01460		For those not familiar with the use of signature tables as
01470	used by Samuel in programs which played the game of checkers, the
01480	concept is best illustrated (Fig.1) by an arrangement of tables used
01490	in the program. There are 27 input terms. Each term evaluates a
01500	specific aspect of a board situation and it is quantized into a
01510	limited but adequate range of values, 7, 5 and 3, in this case. The
01520	terms are divided into 9 sets with 3 terms each, forming the 9 first
01530	level tables. Outputs from the first level tables are quantized to 5
01540	levels and combined into 3 second level tables and, finally, into one

01550	third-level table whose output represents the figure of merit of the
01560	board in question.
01565	
01570		A signature table has an entry for every possible combination
01580	of the input vector. Thus there are 7*5*3 or 105 entries in each of
01590	the first level tables. Training consists of accumulating two counts
01600	for each entry during a training sequence. Count A is incremented
01610	when the current input vector represents a prefered move and count D
01620	is incremented when it is not the prefered move. The output from the
01630	table is computed as a correlation coeficient
01640	 			C=(A-D)/(A+D).
01645		The figure of merit for a board is simply the
01650	coefficient obtained as the output from the final table.
     

01670	
01680			Appendix 2
01690	
01700		Initial Form of Signature Table for Speech Recognition
01710	
01720		The signature tables, as used in speech recognition, must be
01730	particularized to allow for the multi-catagory nature of the output.
01740	Several forms of tables have been investigated. The initial form
01750	tested and used for the data presented in the attached paper uses
01760	tables consisting of two parts, a preamble and the table proper. The
01770	preamble contains: (1) space for saving a record of the current and
01780	recent output reports from the table, (2) identifying information as
01790	to the specific type of table, (3) a parameter that identifies the
01800	desired output from the table and that is used in the learning
01810	process, (4) a gating parameter specifying the input, that is to be
01820	used to gate the table, (5) the sign of the gate,
01825	 (6) the gating level to be used and (7)
01830	parameters that identify the sources of the normal inputs to the
01840	table.
01850	
01860		All inputs are limited in range and specify either the
01870	absolute level of some basic property or more usually the probability
01880	of some property being present. These inputs may be from the original
01890	acoustic input or they may be the outputs of other tables. If from
01900	other tables they may be for the current time step or for earlier
01910	time steps, (subject to practical limits as to the number of time
01920	steps that are saved).
01930	
01940		The output, or outputs, from each table are similarly limited
01950	in range and specify, in all cases, a probability that some
01960	particular significant feature, phonette, phoneme, word segment, word
01970	or phrase is present.
01980	
01990		We are limiting the range of inputs and outputs to values
02000	specified by 3 bits and the number of entries per table to 64
02010	although this choice of values is a matter to be determined by
02020	experiment. We are also providing for any of the following input
02030	combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
02040	(3) three inputs of 2 bits each, and (4) six inputs of 1 bit each.
02050	The uses to which these differint forms are put will be described
02060	later.
02070	
02080		The body of each table contains entries corresponding to
02090	every possible combination of the allowed input parameters. Each
02100	entry in the table actually consists of several parts. There are
02110	fields assigned to accumulate counts of the occurrances of incidents
02120	in which the specifying input values coincided with the different
02130	desired outputs from the table as found during previous learning
02140	sessions and there are fields containing the summarized results of
02150	these learning sessions, which are used as outputs from the table.
02160	The outputs from the tables can then express to the allowed accuracy
02170	all possible functions of the input parameters.
02180	
02190	Operation in the Training Mode
02200	
02210		When operating in the training mode the program is supplied
02220	with a sequence of stored utterances with accompanying phonetic
02230	transcriptions. Each segment of the incoming speech signal is
02240	analysed (Fourier transforms or inverse filter equivalent) to obtain
02250	the necessary input parmeters for the lowest level tables in the
02260	signature table hierarchy. At the same time reference is made to a
02270	table of phonetic "hints" which prescribe the desired outputs from
02280	each table which correspond to all possible phonemic inputs. The
02290	signature tables are then processed.
02300	
02310		The processing of each table is done in two steps, one
02320	process at each entry to the table and the second only periodically.
02330	The first process consists of locating a single entry line within the
02340	table as specified by the inputs to the table and adding a 1 to the
02350	appropriate field to indicate the presence of the property specified
02360	by hint table as corresponding to the phoneme specified in the
02370	phonemic transcription. At this time a report is also made as to the
02380	table's output as determined from the averaged results of previous
02390	learning so that a running record may be kept of the performance of
02400	the system. At periodic intervals all tables are updated to
02410	incorporate recent learning results. To make this process easily
02420	understandable, let us restrict our attention to a table used to
02430	identify a single significant feature say Voicing. The hint table
02440	will identify whether or not the phoneme currently being processed is
02450	to be considered voiced. If it is voiced, a 1 is added to the "yes"
02460	field of the entry line located by the normal inputs to the table. If
02470	it is not voiced, a 1 is added to the "no" field. At updating time
02480	the output that this entry will subsequently report is determined by
02490	dividing the accumulated sum in the "yes" field by the sum of the
02500	numbers in the "yes" and the "no" fields, and reporting this quantity
02510	as a number in the range from 0 to 7. Actually the process is a bit
02520	more complicated than this and it varies with the exact type of table
02530	under consideration, as reported in detail in appendix B. Outputs
02540	from the signature tables are not probabilities, in the strict sense,
02550	but are the statistically-arrived-at odds based on the actual
02560	learning sequence.
02570	
02580		The preamble of the table has space for storing twelve past
02590	outputs. An input to a table can be delayed to that extent. This table
02600	relates outcomes of previous events with the present hint-the
02610	learning input. A certain amount of context dependent learning is thus
02620	possible with the limitation that the specified delays are constant.
02630	
02640		The interconnected hierarchy of tables form a network which
02650	runs increamentally, in steps synchronous with time window over which
02660	the input signal is analised. The present window width is set at 12.8
02670	ms.(256 points at 20 K samples/sec.) with overlap of 6.4 ms. Inputs
02680	to this network are the parameters abstracted from the frequency
02690	analyses of the signal, and the specified hint. The outputs of the
02700	network could be either the probability attached to every phonetic
02710	symbol or the output of a table associated with a feature such as
02720	voiced, vowel etc. The point to be made is that the output generated
02730	for a segment is essentially independent of its contiguous
02740	segments. The dependency achieved by using delayes in the inputs is
02750	invisible to the outputs. The outputs thus report the best estimate on
02760	what the current acoustic input is with no relation to the past
02770	outputs. Relating the successive outputs along the time dimension is
02780	realised by counters.
02790	
02800	The Use of COUNTERS
02810	
02820		The transition from initial segment space to event space is
02830	made posible by means of COUNTERS which are summed and reiniated
02840	whenever their inputs cross specified threshold values, being
02850	triggered on when the input exceeds the threshold and off when it
02860	falls below. Momentary spikes are eliminated by specifying time
02870	hysteresis, the number of consecutive segments for which the input
02880	must be above the threshold. The output of a counter provides
02890	information about starting time, duration and average input for the
02900	period it was active.
02910	
02920		Since a counter can reference a table at any level in the
02930	hierarchy of tables, it can reflect any desired degree of information
02940	reduction. For example, a counter may be set up to show a section of
02950	speech to be a vowel, a front vowel or the vowel /I/. The counters can
02960	be looked upon to represent a mapping of parameter-time space into a
02970	feature-time space, or at a higher level symbol-time space. It may be
02980	useful to carry along the feature information as a back up in those
02990	situations where the symbolic information is not acceptable to
03000	syntactic or semantic interpretation.
03010	
03020		In the same manner as the tables, the counters run completely
03030	independent of each other. In a recognition run the counters may
03040	overlap in arbitrary fashion, may leave out gaps where no counter has
03050	been triggered or may not line up nicely. A properly segmented output,
03060	where the consecutive sections are in time sequence and are neatly
03070	labled, is essential for processing it further. This is achieved by
03080	registering the instants when the counters are triggered or
03090	terminated to form time segments called events.
03100	
03110		An event is the period between successive activation or
03120	termination of any counter. An event shorter than a specified time is
03130	merely ignored. A record of event durations and upto three active
03140	counters, ordered according to their probability, is maintained.
03150	
03160		An event resulting from the processing described so far,
03170	represents a phonette - one of the basic speech categories defined as
03180	hints in the learning process. It is only an estimate of closeness to
03190	a speech category , based on past learning. Also each category has a
03200	more-or-less stationary spectral characterisation. Thus a category may
03210	have a phonemic equivalent as in the case of vowels , it may be
03220	common to phoneme class as for the voiced or unvoiced stop gaps or it
03230	may be subphonemic as a T-burst or a K-burst. The choices are based on
03240	acoustic expediency, i.e. optimisation of the learning rather than
03250	any linguistic considerations. However a higher level interpretive
03260	programs may best operate on inputs resembling phonemic
03270	trancription. The contiguous events may be coalesced into phoneme like
03280	units using diadic or triadic probabilities and acoustic-phonetic
03290	rules particular to the system. For example, a period of silence
03300	followed by a type of burst or a short friction may be combined to
03310	form the corresponding stop. A short friction or a burst following a
03320	nasal or a lateral may be called a stop even if the silence period is
03330	short or absent. Clearly these rules must be specific to the system,
03340	based on the confidence with which durations and phonette categories
03350	are recognised.
03360